Deep mutationally scanned (DMS) CHIKV E3/E2 virus library maps viral amino acid preferences and predicts viral escape mutants of neutralizing CHIKV antibodies

Authors
Affiliation

Megan M. Stumpf

University of Colorado - Anschutz Medical Campus

Tonya Brunetti

Bennett J. Davenport

Mary K. McCarthy

Thomas E. Morrison

Published

March 4, 2025

Introduction

This document describes the analysis pipeline and software inputs used to analyze sequencing data generated in the following manuscript (currently available as a preprint):

Reference

M. Stumpf M, Brunetti T, J. Davenport B, K. McCarthy M, E. Morrison T. 2025 (in press). Deep mutationally scanned (DMS) CHIKV E3/E2 virus library maps viral amino acid preferences and predicts viral escape mutants of neutralizing CHIKV antibodies. J. Virol.

1 Software Requirements

For this analysis:

2 Analysis Pipeline

Diagram 1. Analysis Pipeline Workflow

%%{init: {"theme": "neutral"}}%%

flowchart TB
  A{"Raw FASTQ<br>Files"}
  A --> B["Trim Reads<br><code>CutAdapt</code>"]
  A -.-> C("QC Check<br><code>FastQC</code><br><code>MultiQC</code>")
  B -.-> D("QC Check<br><code>FastQC</code><br><code>MultiQC</code>")
  B --> E["Subsample<br>Reads<br><code>seqtk</code>"]
  E -.-> F("QC Check<br><code>FastQC</code><br><code>MultiQC</code>")
  E --> G["Alignment/<br>Codon Inference"<br><code>VirVarSeq</code>]
  G --> H["Formatting for<br><code>batchdiffsel</code>"]
  G --> I["Calculate<br><code>ndet</code>"]
  G -.-> J("QC Check")
  I --> K["Visualization<br><code>dms-viz</code>"]
  H --> L["Differential<br>Selection<br><code>batchdiffsel</code>"]
  L --> K
  L --> M["Visualization<br><code>dmslogo</code>"]
  I --> N["Visualization<br><code>megaLogo</code>"]

3 Sample Preparation

Brief Methods: (For full detailed methods, see Manuscript)

  • Viral supernatants were treated with RNase ONE to remove non-virion associated RNAs.
  • RNA was isolated from treated supernatants using Trizol and treated with DNase H on-column and aliquoted into 3 x 10 uL aliquots and frozen at -80°C.
  • RNA aliquots were thawed and reverse transcribed using SuperScript IV following manufacturer recommendations.
  • Full cDNA reaction volume served as a template for an amplicon PCR reaction to amplify the mutagenized CHIKV p62 region for sequencing.
  • PCR products were PCR purified and eluted in molecular grade water.

4 Library Preparation

Completed at the CU Anschutz Genomics Shared Resource

  • Libraries were mechanically sheared via Covaris
  • Fragments were barcoded using Ovation Ultralow System V2 (Tecan)
  • Barcoded libraries were batched and loaded onto an Illumina NovaSeq 6000
  • Each sample was sequenced with 2x150bp reads at a depth of 25M paired-end reads (50M pairs)
  • Samples were demultiplexed and RAW fastq files delivered

5 Read Preparation and Analysis

Overview of code breakdown and purpose(s) for execution of each code block.

5.1 FastQC/MultiQC/CutAdapt

Checking quality metrics and removing adapter sequences prior to aligning to the CHIKV reference genome.

  • Raw fastq files were analyzed for QC metrics using FastQC and summary reports generated via MultiQC
  • Samples were trimmed of Illumina universal adapters via cutadapt
  • QC metrics were evaluated again with FastQC and MultiQC
  • Trimmed samples were subsampled to 25M reads per pair via seqtk
  • QC metrics were evaluated once more with FastQC and MultiQC
Note

Samples were prepared and sequenced across 3 separate runs (12/12/2022, 06/13/2023, and 08/01/2023) depending on the sample type and code was executed independently for each run on individual seeds.

All three runs have been combined within code blocks for brevity but each contain their own wtDNA control for error-correction in downstream analyses.

To clean and prep all reads, the following shell scripts were run:

Expand to View Code Blocks
step1a_raws_qc.sh
#!/bin/bash

# Run fastqc to check raw read sequencing quality for each run
~/software/FastQC/fastqc --outdir ./12122022/raw_fastqc_reports/ --threads 1 ./12122022/raw_fastq_files/*.fastq.gz
~/software/FastQC/fastqc --outdir ./06132023/raw_fastqc_reports/ --threads 1 ./06132023/raw_fastq_files/*.fastq.gz
~/software/FastQC/fastqc --outdir ./08012023/raw_fastqc_reports/ --threads 9 ./08012023/raw_fastq_files/*.fastq.gz

# Combine raw fastqc files into multiqc report for each run
multiqc ./12122022/raw_fastqc_reports/ --filename raw_multiqc_report.html --outdir ./12122022/multiqc_reports/
multiqc ./06132023/raw_fastqc_reports/ --filename raw_multiqc_report.html --outdir ./06132023/multiqc_reports/
multiqc ./08012023/raw_fastqc_reports/ --filename raw_multiqc_report.html --outdir ./08012023/multiqc_reports/
step1b_trimming.sh
#!/bin/bash

# Trim reads using cutadapt for all samples for each run
### 12122022
## wtDNA
cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o ./12122022/trimmed_fastq_files/tr_wtDNA_S17_L003_R1_001.fastq -p ./12122022/trimmed_fastq_files/tr_wtDNA_S17_L003_R2_001.fastq wtDNA_S17_L003_R1_001.fastq.gz wtDNA_S17_L003_R2_001.fastq.gz --nextseq-trim=20 -m 10 --json=wtDNA.cutadapt.json

## mutDNA
cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o ./12122022/trimmed_fastq_files/tr_mutDNA_lib1_S18_L003_R1_001.fastq -p ./12122022/trimmed_fastq_files/tr_mutDNA_lib1_S18_L003_R2_001.fastq mutDNA_lib1_S18_L003_R1_001.fastq.gz mutDNA_lib1_S18_L003_R2_001.fastq.gz --nextseq-trim=20 -m 10 --json=mutDNA.cutadapt.json

### 06132023
## wtDNA
cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o ./06132023/trimmed_fastq_files/tr_wtDNA_S25_L002_R1_001.fastq -p ./06132023/trimmed_fastq_files/tr_wtDNA_S25_L002_R2_001.fastq wtDNA_S25_L002_R1_001.fastq.gz wtDNA_S25_L002_R2_001.fastq.gz --nextseq-trim=20 -m 10 --json=wtDNA.cutadapt.json

## mutVirus_p1
cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o ./06132023/trimmed_fastq_files/tr_mutVirus_lib1_p1_S26_L002_R1_001.fastq -p ./06132023/trimmed_fastq_files/tr_mutVirus_lib1_p1_S26_L002_R2_001.fastq mutVirus_lib1_p1_S26_L002_R1_001.fastq.gz mutVirus_lib1_p1_S26_L002_R2_001.fastq.gz --nextseq-trim=20 -m 10 --json=mutVirus.p1.cutadapt.json

## mutVirus_p2
cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o ./06132023/trimmed_fastq_files/tr_mutVirus_lib1_p2_S27_L002_R1_001.fastq -p ./06132023/trimmed_fastq_files/tr_mutVirus_lib1_p2_S27_L002_R2_001.fastq mutVirus_lib1_p2_S27_L002_R1_001.fastq.gz mutVirus_lib1_p2_S27_L002_R2_001.fastq.gz --nextseq-trim=20 -m 10 --json=mutVirus.p2.cutadapt.json

### 08012023
## wtDNA
cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o ./08012023/trimmed_fastq_files/tr_wtDNA_S73_L003_R1_001.fastq -p ./08012023/trimmed_fastq_files/tr_wtDNA_S73_L003_R2_001.fastq wtDNA_S73_L003_R1_001.fastq.gz wtDNA_S73_L003_R2_001.fastq.gz --nextseq-trim=20 -m 10 --json=wtDNA.cutadapt.json

## mutVirus_CHK11_1-1
cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o ./08012023/trimmed_fastq_files/tr_mutVirus_CHK11_1_1_S80_L003_R1_001.fastq -p ./08012023/trimmed_fastq_files/tr_mutVirus_CHK11_1_1_S80_L003_R2_001.fastq mutVirus_CHK11_1_1_S80_L003_R1_001.fastq.gz mutVirus_CHK11_1_1_S80_L003_R2_001.fastq.gz --nextseq-trim=20 -m 10 --json=mutVirus.CHK11_1-1.cutadapt.json

## mutVirus_CHK11_3-1
cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o ./08012023/trimmed_fastq_files/tr_mutVirus_CHK11_3_1_S81_L003_R1_001.fastq -p ./08012023/trimmed_fastq_files/tr_mutVirus_CHK11_3_1_S81_L003_R2_001.fastq mutVirus_CHK11_3_1_S81_L003_R1_001.fastq.gz mutVirus_CHK11_3_1_S81_L003_R2_001.fastq.gz --nextseq-trim=20 -m 10 --json=mutVirus.CHK11_3-1.cutadapt.json

## mutVirus_CHK152_1-2
cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o ./08012023/trimmed_fastq_files/tr_mutVirus_CHK152_1_2_S76_L003_R1_001.fastq -p ./08012023/trimmed_fastq_files/tr_mutVirus_CHK152_1_2_S76_L003_R2_001.fastq mutVirus_CHK152_1_2_S76_L003_R1_001.fastq.gz mutVirus_CHK152_1_2_S76_L003_R2_001.fastq.gz --nextseq-trim=20 -m 10 --json=mutVirus.CHK152_1-2.cutadapt.json

## mutVirus_CHK152_2-1
cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o ./08012023/trimmed_fastq_files/tr_mutVirus_CHK152_2_1_S77_L003_R1_001.fastq -p ./08012023/trimmed_fastq_files/tr_mutVirus_CHK152_2_1_S77_L003_R2_001.fastq mutVirus_CHK152_2_1_S77_L003_R1_001.fastq.gz mutVirus_CHK152_2_1_S77_L003_R2_001.fastq.gz --nextseq-trim=20 -m 10 --json=mutVirus.CHK152_2-1.cutadapt.json

## mutVirus_CHK265_1-2
cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o ./08012023/trimmed_fastq_files/tr_mutVirus_CHK265_1_2_S78_L003_R1_001.fastq -p ./08012023/trimmed_fastq_files/tr_mutVirus_CHK265_1_2_S78_L003_R2_001.fastq mutVirus_CHK265_1_2_S78_L003_R1_001.fastq.gz mutVirus_CHK265_1_2_S78_L003_R2_001.fastq.gz --nextseq-trim=20 -m 10 --json=mutVirus.CHK265_1-2.cutadapt.json

## mutVirus_CHK265_2-3
cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o ./08012023/trimmed_fastq_files/tr_mutVirus_CHK265_2_3_S79_L003_R1_001.fastq -p ./08012023/trimmed_fastq_files/tr_mutVirus_CHK265_2_3_S79_L003_R2_001.fastq mutVirus_CHK265_2_3_S79_L003_R1_001.fastq.gz mutVirus_CHK265_2_3_S79_L003_R2_001.fastq.gz --nextseq-trim=20 -m 10 --json=mutVirus.CHK265_2-3.cutadapt.json

## mutVirus_only_1-2
cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o ./08012023/trimmed_fastq_files/tr_mutVirus_only_1_2_S74_L003_R1_001.fastq -p ./08012023/trimmed_fastq_files/tr_mutVirus_only_1_2_S74_L003_R2_001.fastq mutVirus_only_1_2_S74_L003_R1_001.fastq.gz mutVirus_only_1_2_S74_L003_R2_001.fastq.gz --nextseq-trim=20 -m 10 --json=mutVirus.only_1-2.cutadapt.json

## mutVirus_only_2-1
cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o ./08012023/trimmed_fastq_files/tr_mutVirus_only_2_1_S75_L003_R1_001.fastq -p ./08012023/trimmed_fastq_files/tr_mutVirus_only_2_1_S75_L003_R2_001.fastq mutVirus_only_2_1_S75_L003_R1_001.fastq.gz mutVirus_only_2_1_S75_L003_R2_001.fastq.gz --nextseq-trim=20 -m 10 --json=mutVirus.only_2-1.cutadapt.json
step1c_trimmed_qc.sh
#!/bin/bash

# Recheck quality of all trimmed samples for each run
~/software/FastQC/fastqc --outdir ./12122022/trimmed_fastqc_reports/ --threads 9 ./12122022/trimmed_fastqc_files/*.fastq
~/software/FastQC/fastqc --outdir ./06132023/trimmed_fastqc_reports/ --threads 2 ./06132023/trimmed_fastqc_files/*.fastq
~/software/FastQC/fastqc --outdir ./08012023/trimmed_fastqc_reports/ --threads 9 ./08012023/trimmed_fastqc_files/*.fastq

# Combine trimmed fastqc files into multiqc report for each run
multiqc ./12122022/trimmed_fastqc_reports/ --filename trimmed_multiqc_report.html --outdir ./12122022/multiqc_reports/
multiqc ./06132023/trimmed_fastqc_reports/ --filename trimmed_multiqc_report.html --outdir ./06132023/multiqc_reports/
multiqc ./08012023/trimmed_fastqc_reports/ --filename trimmed_multiqc_report.html --outdir ./08012023/multiqc_reports/
step1d_subsampling.sh
#!/bin/bash

# Subsample trimmed fastq files to 25000000 for each run
### 12122022
/usr/bin/time seqtk sample -s12282022 ./12122022/trimmed_fastq_files/tr_wtDNA_S17_L003_R1_001.fastq 25000000 | gzip > ./12122022/subsampled_fastq_files/wtDNA_seed12282022_subsampled_25000000_R1.fastq.gz
/usr/bin/time seqtk sample -s12282022 ./12122022/trimmed_fastq_files/tr_wtDNA_S17_L003_R2_001.fastq 25000000 | gzip > ./12122022/subsampled_fastq_files/wtDNA_seed12282022_subsampled_25000000_R2.fastq.gz
/usr/bin/time seqtk sample -s12282022 ./12122022/trimmed_fastq_files/tr_mutDNA_lib1_S18_L003_R1_001.fastq 25000000 | gzip > ./12122022/subsampled_fastq_files/mutDNA_lib1_seed12282022_subsampled_25000000_R1.fastq.gz
/usr/bin/time seqtk sample -s12282022 ./12122022/trimmed_fastq_files/tr_mutDNA_lib1_S18_L003_R2_001.fastq 25000000 | gzip > ./12122022/subsampled_fastq_files/mutDNA_lib1_seed12282022_subsampled_25000000_R2.fastq.gz

### 06132023
/usr/bin/time seqtk sample -s06212023 ./06132023/trimmed_fastq_files/tr_wtDNA_S25_L002_R1_001.fastq 25000000 | gzip > ./06132023/subsampled_fastq_files/wtDNA_seed06212023_subsampled_25000000_R1.fastq.gz
/usr/bin/time seqtk sample -s06212023 ./06132023/trimmed_fastq_files/tr_wtDNA_S25_L002_R2_001.fastq 25000000 | gzip > ./06132023/subsampled_fastq_files/wtDNA_seed06212023_subsampled_25000000_R2.fastq.gz
/usr/bin/time seqtk sample -s06212023 ./06132023/trimmed_fastq_files/tr_mutVirus_lib1_p1_S26_L002_R1_001.fastq 25000000 | gzip > ./06132023/subsampled_fastq_files/mutVirus_p1_seed06212023_subsampled_25000000_R1.fastq.gz
/usr/bin/time seqtk sample -s06212023 ./06132023/trimmed_fastq_files/tr_mutVirus_lib1_p1_S26_L002_R2_001.fastq 25000000 | gzip > ./06132023/subsampled_fastq_files/mutVirus_p1_seed06212023_subsampled_25000000_R2.fastq.gz
/usr/bin/time seqtk sample -s06212023 ./06132023/trimmed_fastq_files/tr_mutVirus_lib1_p2_S27_L002_R1_001.fastq 25000000 | gzip > ~./06132023/subsampled_fastq_files/mutVirus_p2_seed06212023_subsampled_25000000_R1.fastq.gz
/usr/bin/time seqtk sample -s06212023 ./06132023/trimmed_fastq_files/tr_mutVirus_lib1_p2_S27_L002_R2_001.fastq 25000000 | gzip > ./06132023/subsampled_fastq_files/mutVirus_p2_seed06212023_subsampled_25000000_R2.fastq.gz

### 08012023
/usr/bin/time seqtk sample -s08122023 ./08012023/trimmed_fastq_files/tr_wtDNA_S73_L003_R2_001.fastq 25000000 | gzip > ./08012023/subsampled_fastq_files/wtDNA_seed08122023_subsampled_25000000_R2.fastq.gz
/usr/bin/time seqtk sample -s08122023 ./08012023/trimmed_fastq_files/tr_mutVirus_CHK11_1_1_S80_L003_R1_001.fastq 25000000 | gzip > ./08012023/subsampled_fastq_files/mutVirus_CHK11_1_1_seed08122023_subsampled_25000000_R1.fastq.gz
/usr/bin/time seqtk sample -s08122023 ./08012023/trimmed_fastq_files/tr_mutVirus_CHK11_1_1_S80_L003_R2_001.fastq 25000000 | gzip > ./08012023/subsampled_fastq_files/mutVirus_CHK11_1_1_seed08122023_subsampled_25000000_R2.fastq.gz
/usr/bin/time seqtk sample -s08122023 ./08012023/trimmed_fastq_files/tr_mutVirus_CHK11_3_1_S81_L003_R1_001.fastq 25000000 | gzip > ./08012023/subsampled_fastq_files/mutVirus_CHK11_3_1_seed08122023_subsampled_25000000_R1.fastq.gz
/usr/bin/time seqtk sample -s08122023 ./08012023/trimmed_fastq_files/tr_mutVirus_CHK11_3_1_S81_L003_R2_001.fastq 25000000 | gzip > ./08012023/subsampled_fastq_files/mutVirus_CHK11_3_1_seed08122023_subsampled_25000000_R2.fastq.gz
/usr/bin/time seqtk sample -s08122023 ./08012023/trimmed_fastq_files/tr_mutVirus_CHK152_1_2_S76_L003_R1_001.fastq 25000000 | gzip > ./08012023/subsampled_fastq_files/mutVirus_CHK152_1_2_seed08122023_subsampled_25000000_R1.fastq.gz
/usr/bin/time seqtk sample -s08122023 ./08012023/trimmed_fastq_files/tr_mutVirus_CHK152_1_2_S76_L003_R2_001.fastq 25000000 | gzip > ./08012023/subsampled_fastq_files/mutVirus_CHK152_1_2_seed08122023_subsampled_25000000_R2.fastq.gz
/usr/bin/time seqtk sample -s08122023 ./08012023/trimmed_fastq_files/tr_mutVirus_CHK152_2_1_S77_L003_R1_001.fastq 25000000 | gzip > ./08012023/subsampled_fastq_files/mutVirus_CHK152_2_1_seed08122023_subsampled_25000000_R1.fastq.gz
/usr/bin/time seqtk sample -s08122023 ./08012023/trimmed_fastq_files/tr_mutVirus_CHK152_2_1_S77_L003_R2_001.fastq 25000000 | gzip > ./08012023/subsampled_fastq_files/mutVirus_CHK152_2_1_seed08122023_subsampled_25000000_R2.fastq.gz
/usr/bin/time seqtk sample -s08122023 ./08012023/trimmed_fastq_files/tr_mutVirus_CHK265_1_2_S78_L003_R1_001.fastq 25000000 | gzip > ./08012023/subsampled_fastq_files/mutVirus_CHK265_1_2_seed08122023_subsampled_25000000_R1.fastq.gz
/usr/bin/time seqtk sample -s08122023 ./08012023/trimmed_fastq_files/tr_mutVirus_CHK265_1_2_S78_L003_R2_001.fastq 25000000 | gzip > ./08012023/subsampled_fastq_files/mutVirus_CHK265_1_2_seed08122023_subsampled_25000000_R2.fastq.gz
/usr/bin/time seqtk sample -s08122023 ./08012023/trimmed_fastq_files/tr_mutVirus_CHK265_2_3_S79_L003_R1_001.fastq 25000000 | gzip > ./08012023/subsampled_fastq_files/mutVirus_CHK265_2_3_seed08122023_subsampled_25000000_R1.fastq.gz
/usr/bin/time seqtk sample -s08122023 ./08012023/trimmed_fastq_files/tr_mutVirus_CHK265_2_3_S79_L003_R2_001.fastq 25000000 | gzip > ./08012023/subsampled_fastq_files/mutVirus_CHK265_2_3_seed08122023_subsampled_25000000_R2.fastq.gz
/usr/bin/time seqtk sample -s08122023 ./08012023/trimmed_fastq_files/tr_mutVirus_only_1_2_S74_L003_R1_001.fastq 25000000 | gzip > ./08012023/subsampled_fastq_files/mutVirus_only_1_2_seed08122023_subsampled_25000000_R1.fastq.gz
/usr/bin/time seqtk sample -s08122023 ./08012023/trimmed_fastq_files/tr_mutVirus_only_1_2_S74_L003_R2_001.fastq 25000000 | gzip > ./08012023/subsampled_fastq_files/mutVirus_only_1_2_seed08122023_subsampled_25000000_R2.fastq.gz
/usr/bin/time seqtk sample -s08122023 ./08012023/trimmed_fastq_files/tr_mutVirus_only_2_1_S75_L003_R1_001.fastq 25000000 | gzip > ./08012023/subsampled_fastq_files/mutVirus_only_2_1_seed08122023_subsampled_25000000_R1.fastq.gz
/usr/bin/time seqtk sample -s08122023 ./08012023/trimmed_fastq_files/tr_mutVirus_only_2_1_S75_L003_R2_001.fastq 25000000 | gzip > ./08012023/subsampled_fastq_files/mutVirus_only_2_1_seed08122023_subsampled_25000000_R2.fastq.gz
step1e_subsampled_qc.sh
#!/bin/bash

# Recheck quality of subsampled reads for each run
~/software/FastQC/fastqc --outdir ./12222022/subsampled_fastqc_reports/ --threads 6 ./12222022/subsampled_fastqc_files/*.fastq.gz
~/software/FastQC/fastqc --outdir ./06132023/subsampled_fastqc_reports/ --threads 6 ./06132023/subsampled_fastqc_files/*.fastq.gz
~/software/FastQC/fastqc --outdir ./08012023/subsampled_fastqc_reports/ --threads 9 ./08012023/subsampled_fastqc_files/*.fastq.gz

# Combine subsampled/trimmed fastqc files into multiqc report for each run
multiqc ./12122022/subsampled_fastqc_reports/ --filename trimmed_subsampled_multiqc_report.html --outdir ./12122022/multiqc_reports/
multiqc ./06132023/subsampled_fastqc_reports/ --filename trimmed_subsampled_multiqc_report.html --outdir ./06132023/multiqc_reports/
multiqc ./08012023/subsampled_fastqc_reports/ --filename trimmed_subsampled_multiqc_report.html --outdir ./08012023/multiqc_reports/

5.2 VirVarSeq

Alignment of subsampled clean reads to the WT mutagenized CHIKV p62 fragment sequence and codon metric calling with VirVarSeq software.

Software was downloaded from SourceForge and the relevant manuscript can be found here.

12-12-2022

Samples prefix file input for VirVarSeq:

samples.txt

                                             
 wtDNA_seed12282022_subsampled_25000000      
 mutDNA_lib1_seed12282022_subsampled_25000000


To execute VirVarSeq, the following shell script was run:

Expand to View Code Block
step2_run12292022.sh
#!/bin/bash

export R_LIBS_USER=./12292022/VirVarSeq/R/lib
export PERL5LIB=./12292022/VirVarSeq/lib
indir=./12292022/VirVarSeq/fastq
outdir=./12292022/VirVarSeq/results
samples=./12292022/VirVarSeq/samples.txt
ref=../ref/pCHIKV_AF15561.fasta
startpos=9821
endpos=15701
region_start=9821
region_len=1218
qv=0

./map_vs_ref.pl --samplelist $samples --ref $ref --indir $indir --outdir $outdir --mapping paired > VirVarSeq.log 2>&1
./consensus.pl --samplelist $samples --ref $ref --indir $indir --outdir $outdir --start $startpos --end $endpos >> VirVarSeq.log 2>&1
./map_vs_consensus.pl --samplelist $samples --indir $indir --outdir $outdir --mapping paired >> VirVarSeq.log 2>&1
./codon_table.pl --samplelist $samples --ref $ref --outdir $outdir --start=$region_start --len=$region_len --trimming=0 --qual=$qv >> VirVarSeq.log 2>&1
./mixture_model.pl --samplelist $samples --outdir $outdir --ref $ref --start=$region_start --len=$region_len --qual=$qv >> VirVarSeq.log 2>&1 

06-13-2023

Samples prefix file input for VirVarSeq:

samples.txt

                                             
 wtDNA_seed06212023_subsampled_25000000      
 mutVirus_p1_seed06212023_subsampled_25000000
 mutVirus_p2_seed06212023_subsampled_25000000


To execute VirVarSeq, the following shell script was run:

Expand to View Code Block
step2_run06132023.sh
#!/bin/bash

export R_LIBS_USER=./06132023/VirVarSeq/R/lib
export PERL5LIB=./06132023/VirVarSeq/lib
indir=./06132023/VirVarSeq/subsampled_fastq_files
outdir=./06132023/VirVarSeq/results
samples=./06132023/VirVarSeq/samples.txt
ref=../ref/pCHIKV_AF15561.fasta
startpos=9821
endpos=15701
region_start=9821
region_len=1218
qv=0

/usr/bin/time ./map_vs_ref.pl --samplelist $samples --ref $ref --indir $indir --outdir $outdir --mapping paired > VirVarSeq.log 2>&1
/usr/bin/time ./consensus.pl --samplelist $samples --ref $ref --indir $indir --outdir $outdir --start $startpos --end $endpos >> VirVarSeq.log 2>&1
/usr/bin/time ./map_vs_consensus.pl --samplelist $samples --indir $indir --outdir $outdir --mapping paired >> VirVarSeq.log 2>&1
/usr/bin/time ./codon_table.pl --samplelist $samples --ref $ref --outdir $outdir --start=$region_start --len=$region_len --trimming=0 --qual=$qv >> VirVarSeq.log 2>&1
/usr/bin/time ./mixture_model.pl --samplelist $samples --outdir $outdir --ref $ref --start=$region_start --len=$region_len --qual=$qv >> VirVarSeq.log 2>&1 

08-01-2023

Samples prefix file input for VirVarSeq:

samples.txt

                                                     
 wtDNA_seed08122023_subsampled_25000000              
 mutVirus_CHK11_1_1_seed08122023_subsampled_25000000 
 mutVirus_CHK11_3_1_seed08122023_subsampled_25000000 
 mutVirus_CHK152_1_2_seed08122023_subsampled_25000000
 mutVirus_CHK152_2_1_seed08122023_subsampled_25000000
 mutVirus_CHK265_1_2_seed08122023_subsampled_25000000
 mutVirus_CHK265_2_3_seed08122023_subsampled_25000000
 mutVirus_only_1_2_seed08122023_subsampled_25000000  
 mutVirus_only_2_1_seed08122023_subsampled_25000000  


To execute VirVarSeq, the following shell script was run:

Expand to View Code Block
step2_run08132023.sh
#!/bin/bash

export R_LIBS_USER=./08012023/VirVarSeq/R/lib
export PERL5LIB=./08012023/VirVarSeq/lib
indir=./08012023/subsampled_fastq_files
outdir=./08012023/VirVarSeq/results
samples=./08012023/VirVarSeq/samples.txt
ref=../ref/pCHIKV_AF15561.fasta
startpos=9821
endpos=15701
region_start=9821
region_len=1218
qv=0

/usr/bin/time ./map_vs_ref.pl --samplelist $samples --ref $ref --indir $indir --outdir $outdir --mapping paired > VirVarSeq.log 2>&1
/usr/bin/time ./consensus.pl --samplelist $samples --ref $ref --indir $indir --outdir $outdir --start $startpos --end $endpos >> VirVarSeq.log 2>&1
/usr/bin/time ./map_vs_consensus.pl --samplelist $samples --indir $indir --outdir $outdir --mapping paired >> VirVarSeq.log 2>&1
/usr/bin/time ./codon_table.pl --samplelist $samples --ref $ref --outdir $outdir --start=$region_start --len=$region_len --trimming=0 --qual=$qv >> VirVarSeq.log 2>&1
/usr/bin/time ./mixture_model.pl --samplelist $samples --outdir $outdir --ref $ref --start=$region_start --len=$region_len --qual=$qv >> VirVarSeq.log 2>&1 

5.3 \(ndet\)

Creation of lineplots to visualize mutagenesis efficiency and selection results for virus library generation.

5.3.1 Calculation

To calculate the level of diversity tolerated for all iterations of the CHIKV-p62-DMS virus libraries, the number of detected amino acids (\(ndet\)) was determined using the following in-house R code.

In brief we first filtered the VirVarSeq output codon matrices with the following conditions:

  1. Must have greater than 100 counts for that codon-position pair
  2. Must have an average forward and reverse read minimum Phred quality score of 24.

Then, the remaining unique residues were summed for each position (\(ndet\)).

Warning

The code process below is heavily R dependent and not well suited for large batches of samples. Moving forward, pipeline will be converted to a loop-based algorithm and executed with remote compute nodes.

Expand to View Code Block
step3a_diversity_analysis.R
#!/usr/bin/Rscript

# GOAL: Count matrix processing for 'ndet' calculations
# Packages to load
library(plyr)
library(dplyr)
library(reshape2)
library(tidyr)
library(stringr)

# Set working directory to find VirVarSeq output matrices
setwd("./06132023/VirVarSeq/results/mixture_model/")

# List file names but then need to open in excel and save as CSV file
list.files(path = ".", pattern = ".codon")

# Confirm all files successfully converted to .csv
list.files(path = ".", pattern = "Q.10.csv")

# Read in CSV files from VirVarSeq
df1 <- read.csv("wtDNA_seed06212023_subsampled_25000000.Q.10.csv")
df2 <- read.csv("mutVirus_p1_seed06212023_subsampled_25000000.Q.10.csv")
df3 <- read.csv("mutVirus_p2_seed06212023_subsampled_25000000.Q.10.csv")
df4 <- read.csv("mutDNA/mutDNA_r1_standard_prep_seed12282022_subsampled_25000000.Q.10.csv")

# Merge position and AA into single field in order to be able to sort/combine/melt/etc.
df1$MERGE <- paste(df1$POSITION, df1$CODON)
df2$MERGE <- paste(df2$POSITION, df2$CODON)
df3$MERGE <- paste(df3$POSITION, df3$CODON)
df4$MERGE <- paste(df4$POSITION, df4$CODON)

# Subset out only the mutated regions
df1_subset <- df1 %>%
  filter(POSITION <= 398) %>%
  filter(POSITION >= 9)

df2_subset <- df2 %>%
  filter(POSITION <= 398) %>%
  filter(POSITION >= 9)

df3_subset <- df3 %>%
  filter(POSITION <= 398) %>%
  filter(POSITION >= 9)

df4_subset <- df4 %>%
  filter(POSITION <= 398) %>%
  filter(POSITION >= 9)

# Check which sample set has the most unique codons (aka: rows)
print(dim(df1_subset))
print(dim(df2_subset))
print(dim(df3_subset))
print(dim(df4_subset))

# Adding column name identifiers to prepare for a dataset merge
colnames(df1_subset) <- paste(colnames(df1_subset), "df1", sep = "_")
colnames(df2_subset) <- paste(colnames(df2_subset), "df2", sep = "_")
colnames(df3_subset) <- paste(colnames(df3_subset), "df3", sep = "_")
colnames(df4_subset) <- paste(colnames(df4_subset), "df4", sep = "_")

# Renaming MERGE_df# column name to be merge-able across all data sets in the sheet via same column name
colnames(df1_subset) [22] <- "MERGE"
colnames(df2_subset) [22] <- "MERGE"
colnames(df3_subset) [22] <- "MERGE"
colnames(df4_subset) [22] <- "MERGE"

# Create mergable field with position and corresponding amino acid (mutant and REF) to remove codon redundancy
df1_subset$MERGE_AA <- paste(df1_subset$POSITION_df1, df1_subset$AA_df1)
df2_subset$MERGE_AA <- paste(df2_subset$POSITION_df2, df2_subset$AA_df2)
df3_subset$MERGE_AA <- paste(df3_subset$POSITION_df3, df3_subset$AA_df3)
df4_subset$MERGE_AA <- paste(df4_subset$POSITION_df4, df4_subset$AA_df4)

# Count and quality filtering
filtered_df1 <- df1_subset %>%
  filter(CNT_df1 > 100) %>%
  filter(FWD_MEAN_MIN_QUAL_df1 >= 24) %>%
  filter(REV_MEAN_MIN_QUAL_df1 >= 24)

filtered_df2 <- df2_subset %>%
  filter(CNT_df2 > 100) %>%
  filter(FWD_MEAN_MIN_QUAL_df2 >= 24) %>%
  filter(REV_MEAN_MIN_QUAL_df2 >= 24)

filtered_df3 <- df3_subset %>%
  filter(CNT_df3 > 100) %>%
  filter(FWD_MEAN_MIN_QUAL_df3 >= 24) %>%
  filter(REV_MEAN_MIN_QUAL_df3 >= 24)

filtered_df4 <- df4_subset %>%
  filter(CNT_df4 > 100) %>%
  filter(FWD_MEAN_MIN_QUAL_df4 >= 24) %>%
  filter(REV_MEAN_MIN_QUAL_df4 >= 24)

# Calculate ndet per position with filtering metrics applied
filtered_df1 <- filtered_df1 %>%
  group_by(POSITION_df1) %>%
  mutate(ndet = length(unique(MERGE_AA)))

filtered_df2 <- filtered_df2 %>%
  group_by(POSITION_df2) %>%
  mutate(ndet = length(unique(MERGE_AA)))

filtered_df3 <- filtered_df3 %>%
  group_by(POSITION_df3) %>%
  mutate(ndet = length(unique(MERGE_AA)))

filtered_df4 <- filtered_df4 %>%
  group_by(POSITION_df4) %>%
  mutate(ndet = length(unique(MERGE_AA)))

# Determination of avg. # of AAs per position based on filtering including WT
print(mean(filtered_df1$ndet))
print(mean(filtered_df2$ndet))
print(mean(filtered_df3$ndet))
print(mean(filtered_df4$ndet))

# Adjusting the POSITION field
filtered_df1 <- filtered_df1 %>%
  mutate(POSITION = POSITION_df1 - 8)

filtered_df2<- filtered_df2 %>%
  mutate(POSITION = POSITION_df2 - 8)

filtered_df3 <- filtered_df3 %>%
  mutate(POSITION = POSITION_df3 - 8)

filtered_df4 <- filtered_df4 %>%
  mutate(POSITION = POSITION_df4 - 8)

# Add MERGE_FRAC for megalogo
filtered_df1$MERGE_FRAC <- 1/filtered_df1$ndet
filtered_df2$MERGE_FRAC <- 1/filtered_df2$ndet
filtered_df3$MERGE_FRAC <- 1/filtered_df3$ndet
filtered_df4$MERGE_FRAC <- 1/filtered_df4$ndet

# Write to CSV
write.csv(filtered_df1, "./ndet/df1_ndet.csv", row.names = FALSE)
write.csv(filtered_df2, "./ndet/df2_ndet.csv", row.names = FALSE)
write.csv(filtered_df3, "./ndet/df3_ndet.csv", row.names = FALSE)
write.csv(filtered_df4, "./ndet/df4_ndet.csv", row.names = FALSE)

# Remove duplicate MERGE_AA rows for the purpose of megalogo plotting, keeping highest CNT row
filtered_df1_mega <- filtered_df1 %>%
  group_by(MERGE_AA) %>%
  slice_max(order_by = CNT_df1, with_ties = FALSE) %>%
  ungroup()

filtered_df2_mega <- filtered_df2 %>%
  group_by(MERGE_AA) %>%
  slice_max(order_by = CNT_df2, with_ties = FALSE) %>%
  ungroup()

filtered_df3_mega <- filtered_df3 %>%
  group_by(MERGE_AA) %>%
  slice_max(order_by = CNT_df3, with_ties = FALSE) %>%
  ungroup()

filtered_df4_mega <- filtered_df4 %>%
  group_by(MERGE_AA) %>%
  slice_max(order_by = CNT_df4, with_ties = FALSE) %>%
  ungroup()


# Rename POSITION_df# column
filtered_df1_mega <- filtered_df1_mega %>%
  rename(POSITION_p62 = POSITION)

filtered_df2_mega <- filtered_df2_mega %>%
  rename(POSITION_p62 = POSITION)

filtered_df3_mega <- filtered_df3_mega %>%
  rename(POSITION_p62 = POSITION)

filtered_df4_mega <- filtered_df4_mega %>%
  rename(POSITION_p62 = POSITION)

# Remove suffixes for megalogo
names(filtered_df1_mega) <- gsub("_df1$", "", names(filtered_df1_mega))
names(filtered_df2_mega) <- gsub("_df2$", "", names(filtered_df2_mega))
names(filtered_df3_mega) <- gsub("_df3$", "", names(filtered_df3_mega))
names(filtered_df4_mega) <- gsub("_df4$", "", names(filtered_df4_mega))

# Confirm removal was successful
colnames(filtered_df1_mega)
colnames(filtered_df2_mega)
colnames(filtered_df3_mega)
colnames(filtered_df4_mega)
colnames(filtered_df4_mega)

# Sort on position
filtered_df1_mega <- filtered_df1_mega[order(filtered_df1_mega$POSITION_p62), ]
filtered_df2_mega <- filtered_df2_mega[order(filtered_df2_mega$POSITION_p62), ]
filtered_df3_mega <- filtered_df3_mega[order(filtered_df3_mega$POSITION_p62), ]
filtered_df4_mega <- filtered_df4_mega[order(filtered_df4_mega$POSITION_p62), ]

# Write to CSV
write.csv(filtered_df1_mega, "./megalogo/filtered_wtDNA_mega.csv", row.names = FALSE)
write.csv(filtered_df2_mega, "./megalogo/filtered_mutVirus_p1_mega.csv", row.names = FALSE)
write.csv(filtered_df3_mega, "./megalogo/filtered_mutVirus_p2_mega.csv", row.names = FALSE)
write.csv(filtered_df4_mega, "./megalogo/filtered_mutDNA_mega.csv", row.names = FALSE)

5.3.2 Visualization

The following R code was used to generate line plots displaying \(ndet\) diversity at each position.

Expand to View Code Block
step3b_diversity_lineplots.R
#!/usr/bin/Rscript

# GOAL: Generate diversity line plots for each virus library iteration
# Packages to load
library(ggplot2)
library(reshape2)
library(ggprism)
library(gridExtra)
library(grid)

# Read back in filtered CSV files
filtered_df1_mega <- read.csv("./megalogo/filtered_wtDNA_mega.csv")
filtered_df2_mega <- read.csv("./megalogo/filtered_mutVirus_p1_mega.csv")
filtered_df3_mega <- read.csv("./megalogo/filtered_mutVirus_p2_mega.csv")
filtered_df4_mega <- read.csv("./megalogo/filtered_mutDNA_mega.csv")

# Plotting 'ndet' for each position for all virus library iterations
wt <- ggplot(filtered_df1_mega, aes(POSITION_p62, ndet)) +
  geom_line(color = "#FC5F60") +
  xlim(0,400) +
  ylim(0,21) +
  ggtitle("wtDNA") +
  theme_prism(base_size = 11, base_line_size = 0.5) + 
  labs(x = "Position", y = "AA detected per pos. (ndet)") + 
  theme(axis.title.y = element_text(size = 11))

mutVirusp1 <- ggplot(filtered_df2_mega, aes(POSITION_p62, ndet)) +
  geom_line(color = "#B44385") +
  xlim(0,400) +
  ylim(0,21) +
  ggtitle("mutVirus.p1") +
  scale_color_prism() +
  theme_prism(base_size = 11, base_line_size = 0.5) + 
  labs(x = "Position", y = "AA detected per pos. (ndet)") + 
  theme(axis.title.y = element_text(size = 11))

mutVirusp2 <- ggplot(filtered_df3_mega, aes(POSITION_p62, ndet)) +
  geom_line(color = "#164A62") +
  xlim(0,400) +
  ylim(0,21) +
  ggtitle("mutVirus.p2") +
  theme_prism(base_size = 11, base_line_size = 0.5) + 
  labs(x = "Position", y = "AA detected per pos. (ndet)") + 
  theme(axis.title.y = element_text(size = 11))

mutDNA <- ggplot(filtered_df4_mega, aes(POSITION_p62, ndet)) +
  geom_line(color = "#5D578F") +
  xlim(0,400) +
  ylim(0,21) +
  ggtitle("mutDNA") +
  theme_prism(base_size = 11, base_line_size = 0.5) + 
  labs(x = "Position", y = "AA detected per pos. (ndet)") + 
  theme(axis.title.y = element_text(size = 11))

# Print and save each 'ndet' plot
print(wt)
ggsave("./ndet/wtDNA.png", plot = wt, device = "png", height = 4)

print(mutDNA)
ggsave("./ndet/mutDNA.png", plot = mutDNA, device = "png", height = 4)

print(mutVirusp1)
ggsave("./ndet/mutVirusp1.png", plot = mutVirusp1, device = "png", height = 4)

print(mutVirusp2)
ggsave("./ndet/mutVirusp2.png", plot = mutVirusp2, device = "png", height = 4)

# Combine into a grid plot
grid <- grid.arrange(wt, mutVirusp1, mutVirusp2, mutDNA, ncol = 2, top = textGrob("Unique AA per codon position", gp = gpar(fontsize = 14, fontface = "bold")))
grid

# Export grid plot at given dpi and picture format
ggsave("./ndet/gridplot.png", plot = grid, device = "png", dpi = 1200, width = 14, height = 10)
(a) mutVirus.p2
(b) mutDNA
(c) mutVirus.p1
(d) wtDNA
Figure 1: \(ndet\) Lineplots

5.3.3 mutDNA Replicate Comparison

An independent replicate of the mutDNA plasmid library was generated to check the rigor of the library mutagenesis. To compare the mutagenesis efficiency of replicate 1 and replicate 2, the following R analyses were performed.

Expand to View Code Block
library_correlates.R
#!/usr/bin/Rscript

# GOAL: Generate comparison plots for virus library replicates
# Packages to load
library(plyr)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(reshape2)
library(ggprism)
library(gridExtra)
library(grid)
library(cowplot)

# Read in CSV file from VirVarSeq for rep2
df5 <- read.csv("mutDNA/mutDNA_r2_standard_prep_seed10252022_subSampled_25000000.Q.10.csv")

# Merge position and AA into single field
df5$MERGE <- paste(df5$POSITION, df5$CODON)

# Subset out only the mutated regions
df5_subset <- df5 %>%
  filter(POSITION <= 398) %>%
  filter(POSITION >= 9)

# Create mergable field on AA in adition to codon
df5_subset$MERGE_AA <- paste(df5_subset$POSITION, df5_subset$AA)

# Count and quality filtering
filtered_df5 <- df5_subset %>%
  filter(CNT > 100) %>%
  filter(FWD_MEAN_MIN_QUAL >= 24) %>%
  filter(REV_MEAN_MIN_QUAL >= 24)

# Calculate ndet per position after filtering
filtered_df5 <- filtered_df5 %>%
  group_by(POSITION) %>%
  mutate(ndet = length(unique(MERGE_AA)))

# Determine avg. # of AAs per position based on filtering
print(mean(filtered_df5$ndet))

# Adjust the POSITION field
filtered_df5 <- filtered_df5 %>%
  mutate(POSITION_p62 = POSITION - 8)

# Add MERGE_FRAC for megalogo
filtered_df5$MERGE_FRAC <- 1 / filtered_df5$ndet

# Write to CSV
write.csv(filtered_df5, "ndet/df5_ndet.csv", row.names = FALSE)

# Remove duplicate MERGE_AA rows for megalogo plotting purposes
filtered_df5_mega <- filtered_df5 %>%
  group_by(MERGE_AA) %>%
  slice_max(order_by = CNT, with_ties = FALSE) %>%
  ungroup()

# Sort on position
filtered_df5_mega <- filtered_df5_mega[order(filtered_df5_mega$POSITION_p62), ]

# Write to CSV
write.csv(filtered_df5_mega, "./megalogo/filtered_df5_mega.csv", row.names = FALSE)

# Plotting 'ndet' for each position for library rep2
mutDNA <- ggplot(filtered_df4_mega, aes(POSITION_p62, ndet)) +
  geom_line(color = "#5D578F") +
  xlim(0,400) +
  ylim(0,21) +
  ggtitle("mutDNA") +
  theme_prism(base_size = 11, base_line_size = 0.5) + 
  labs(x = "Position", y = "AA detected per pos. (ndet)") + 
  theme(axis.title.y = element_text(size = 11))

print(plot5)
ggsave("ndet/df5.png", plot = plot5, device = "png", height = 4)

# Subset dataframes of mutDNA rep1 (df3 from previous code block) and rep2
filtered_df3_subset <- filtered_df3 %>% 
  select(SAMPLE, POSITION, POSITION_p62, REF_CODON, REF_AA, ndet) %>%
  unique()

filtered_df5_subset <- filtered_df5 %>% 
  select(SAMPLE, POSITION, POSITION_p62, REF_CODON, REF_AA, ndet) %>%
  unique()

# Add column suffixes for merging
colnames(filtered_df3_subset) <- paste(colnames(filtered_df3_subset), "df3", sep = "_")

colnames(filtered_df5_subset) <- paste(colnames(filtered_df5_subset), "df5", sep = "_")

# Merge data frames
combined_wide_abb <- bind_cols(filtered_df3_subset, filtered_df5_subset)

# Calculate differences in ndet for each position of rep1 (df3) and rep2 (df5)
combined_wide_abb$residual <- combined_wide_abb$ndet_df3 - combined_wide_abb$ndet_df5


combined_wide_abb <- combined_wide_abb %>%
  mutate(residual_sample = case_when(
    residual > 0  ~ "rep1",
    residual < 0  ~ "rep2",
    residual == 0 ~ NA
  ))

combined_wide_abb <- combined_wide_abb %>%
  mutate_at(c('residual'), ~na_if(., 0))

# Remove suffixes for long df
names(filtered_df3_subset) <- gsub("_df3$", "", names(filtered_df3_subset))
names(filtered_df5_subset) <- gsub("_df5$", "", names(filtered_df5_subset))

combined_long_abb <- bind_rows(filtered_df3_subset, filtered_df5_subset)


# Plot overlapping lineplot of rep1 and rep2
lineplot <- ggplot(combined_long_abb, aes(x = POSITION_p62, y = ndet, color = SAMPLE)) +
  geom_line(alpha = 0.9, size = 0.5, linetype = "solid") +
  scale_color_manual(values = c("#5D578F", "#0496FF"), 
                     labels = c("rep1", "rep2"), 
                     name = "Replicate") +
  scale_y_continuous(limits = c(0, 21),
                     breaks = seq(0, 20, by = 5)) +
  scale_x_continuous(limits = c(0, 400)) +
  labs(title = "mutDNA",
       x = NULL,
       y = "AA detected per pos. (ndet)") +
  theme_prism(base_size = 12) + 
  theme(axis.title = element_text(size = 12),
        axis.text = element_text(size = 12),  
        plot.title = element_text(size = 16, face = "bold"), 
        legend.title = element_text(),
        axis.ticks.x = element_blank(),
        axis.text.x = element_blank())

print(lineplot)
ggsave("ndet/lineplot.png", plot = lineplot, height = 3, width = 7, dpi = 600)

# Plot differences in rep1 and rep2 via positive/negative lollipop graph
residuals <- ggplot(combined_wide_abb, aes(x = POSITION_p62_df5, y = residual, fill = residual_sample)) +
  annotate("rect", xmin = 0, xmax = 400, ymin = 0, ymax = 10, alpha = 0.1, fill = "#5D578F") +
  annotate("rect", xmin = 0, xmax = 400, ymin = -10, ymax = 0, alpha = 0.1, fill = "#0496FF") +
  geom_col(aes(fill = residual_sample), width = 0.5, na.rm = TRUE) +
  geom_point(alpha = 1, shape = 21, fill = "white", color = "#3C3C3C", size = 1, show.legend = TRUE, na.rm = TRUE) +
  scale_y_continuous(limits = c(-10, 10),
                     breaks = seq(-10, 10, by = 10)) +
  scale_fill_manual(values = c("rep1" = "#5D578F", "rep2"= "#0496FF"), 
                     labels = c("rep1" = "\U2191 rep1", "rep2" = "\U2191 rep2"), 
                     name = "Replicate", 
                     breaks = c("rep1", "rep2"),
                     na.translate = FALSE) +
  scale_color_manual(values = c("rep1" = "#5D578F", "rep2"= "#0496FF"), 
                     labels = c("rep1" = "rep1", "rep2" = "rep2"), 
                     name = "Replicate", 
                     breaks = c("rep1", "rep2"),
                     na.translate = FALSE) +
  geom_hline(yintercept = 0) +
  labs(title = NULL,
       x = "Position",
       y = "+/- (ndet)") +
  theme_prism(base_size = 12) +
  theme(axis.title = element_text(size = 12),
        axis.text = element_text(size = 12),  
        plot.title = element_text(size = 16, face = "bold"), 
        legend.title = element_text(), 
        legend.text = element_text(family = "arial"),
        axis.ticks = element_line()
  )

print(residuals)
ggsave("ndet/residuals.png", plot = residuals, height = 2, width = 7)

# Combine into single figure
replicates_grid <- plot_grid(lineplot, residuals, align = "v", nrow = 2, rel_heights = c(3, 2))
print(grid)

ggsave("ndet/replicates_grid.png", plot = grid, width = 7)
(a) \(ndet\) Lineplots
(b) \(ndet\) Residuals
Figure 2: mutDNA Replicate Comparison of \(ndet\)

5.4 \(neff\)

Comparing the \(ndet\) metric with a traditional DMS metric, \(neff\).

5.4.1 Calculation

To compare the results of our new metric, \(ndet\), with a traditional metric of mutational tolerance referred to as \(neff\), or number of effective AAs (also referred to as “preferences”), the following script was run to calculate the preference of mutVirus.p2 when compared with the baseline mutDNA library.

Because these samples were sequenced on different runs, each sample was independently error corrected with their run’s corresponding wtDNA sequencing control.

step3c_prefs.sh
#!/bin/bash

dms2_prefs --chartype codon_to_aa --excludestop yes --pre mutDNA_wide_sum.txt --post mutVirusp2_wide_sum.txt --name p2errorvsDNA --err wtDNA_DNArun_wide_sum.txt  wtDNA_p2run_wide_sum.txt --ncpus 4

5.4.2 Visualization

To visually compare the \(neff\) metric for mutVirus.p2 with the corresponding \(ndet\) lineplot, the following R script was run.

Expand to View Code Block
neff_comparison.R
#!/usr/bin/Rscript

# GOAL: To plot neff for mutVirus.p2 over mutDNA (minus wtDNA) for comparison with the corresponding ndet plot.
# Packages to load
library(ggplot2)

# Load in prefs file from dms_tools2 
prefs <- read.csv("~/p2vsDNAerror_prefs_entropy.csv")

# Plot lineplot using ggplot2
lineplot_neff <- ggplot(prefs, aes(x = POSITION, y = neff)) +
  geom_line(color = "red") + 
  xlim(0,400) +
  ylim(0,21) +
  ggtitle("mutVirus.p2") +
  theme_prism(base_size = 11, base_line_size = 0.5) + 
  labs(x = "Position", 
       y = "AA effective per pos. (neff)") + 
  theme(axis.title.y = element_text(size = 11))

print(lineplot_neff)

ggsave("neff.png", plot = lineplot_neff, device = "png", height = 3, width = 5, dpi = 600)
Figure 3: mutVirus.p2 (\(neff\))

5.6 batchdiffsel

Calculating differential selection for monoclonal antibody escape assays.

5.6.1 Formatting

Because the output format for VirVarSeq does not meet the requirements to calculate differential selection, in-house R code (below) was utilized to format per the dms_tools2 requirements.

Warning

The code process below is heavily R dependent and not well suited for large batches of samples. Moving forward, pipeline will be converted to a loop-based algorithm and executed with remote compute nodes.

Expand to View Code Block
step5_dmstools_prep.R
#!/usr/bin/Rscript

# GOAL: To reformat the VirVarSeq output for differential selection calculation per dms_tools2 formatting requirements.
# Packages to load
library(plyr)
library(dplyr)
library(ggplot2)
library(reshape2)
library(tidyr)
library(stringr)

# Set working directory to find VirVarSeq output matrices
setwd("./08012023/VirVarSeq/results/mixture_model/diffsel")

# List file names but then need to open in excel and save as .csv file
# List.files(path = ".", pattern = ".Q.10.codon")

# Confirm all files successfully converted to .csv
list.files(path = ".", pattern = ".Q.10.csv")

# Read in .csv files from VirVarSeq
df1 <- read.csv("wtDNA_seed08122023_subsampled_25000000.Q.10.csv")
df2 <- read.csv("mutVirus_p2_seed06212023_subsampled_25000000.Q.10.csv")
df3 <- read.csv("mutVirus_CHK11_1_1_seed08122023_subsampled_25000000.Q.10.csv")
df4 <- read.csv("mutVirus_CHK11_3_1_seed08122023_subsampled_25000000.Q.10.csv")
df5 <- read.csv("mutVirus_CHK152_1_2_seed08122023_subsampled_25000000.Q.10.csv")
df6 <- read.csv("mutVirus_CHK152_2_1_seed08122023_subsampled_25000000.Q.10.csv")
df7 <- read.csv("mutVirus_CHK265_1_2_seed08122023_subsampled_25000000.Q.10.csv")
df8 <- read.csv("mutVirus_CHK265_2_3_seed08122023_subsampled_25000000.Q.10.csv")
df9 <- read.csv("mutVirus_only_1_2_seed08122023_subsampled_25000000.Q.10.csv")
df10 <- read.csv("mutVirus_only_2_1_seed08122023_subsampled_25000000.Q.10.csv")

# Merge position and AA into single field in order to be able to sort/combine/melt/etc.
df1$MERGE <- paste(df1$POSITION, df1$AA)
df2$MERGE <- paste(df2$POSITION, df2$AA)
df3$MERGE <- paste(df3$POSITION, df3$AA)
df4$MERGE <- paste(df4$POSITION, df4$AA)
df5$MERGE <- paste(df5$POSITION, df5$AA)
df6$MERGE <- paste(df6$POSITION, df6$AA)
df7$MERGE <- paste(df7$POSITION, df7$AA)
df8$MERGE <- paste(df8$POSITION, df8$AA)
df9$MERGE <- paste(df9$POSITION, df9$AA)
df10$MERGE <- paste(df10$POSITION, df10$AA)

# Subset out only the mutated regions
df1_subset <- df1 %>%
  filter(POSITION <= 398) %>%
  filter(POSITION >= 9)

df2_subset <- df2 %>%
  filter(POSITION <= 398) %>%
  filter(POSITION >= 9)

df3_subset <- df3 %>%
  filter(POSITION <= 398) %>%
  filter(POSITION >= 9)

df4_subset <- df4 %>%
  filter(POSITION <= 398) %>%
  filter(POSITION >= 9)

df5_subset <- df5 %>%
  filter(POSITION <= 398) %>%
  filter(POSITION >= 9)

df6_subset <- df6 %>%
  filter(POSITION <= 398) %>%
  filter(POSITION >= 9)

df7_subset <- df7 %>%
  filter(POSITION <= 398) %>%
  filter(POSITION >= 9)

df8_subset <- df8 %>%
  filter(POSITION <= 398) %>%
  filter(POSITION >= 9)

df9_subset <- df9 %>%
  filter(POSITION <= 398) %>%
  filter(POSITION >= 9)

df10_subset <- df10 %>%
  filter(POSITION <= 398) %>%
  filter(POSITION >= 9)

# Calculate the total reads for each amino acid, not codon, per position
df1_subset_sum <- df1_subset %>%
  group_by(MERGE) %>%
  summarise(SUM = sum(CNT))

df2_subset_sum <- df2_subset %>%
  group_by(MERGE) %>%
  summarise(SUM = sum(CNT))

df3_subset_sum <- df3_subset %>%
  group_by(MERGE) %>%
  summarise(SUM = sum(CNT))

df4_subset_sum <- df4_subset %>%
  group_by(MERGE) %>%
  summarise(SUM = sum(CNT))

df5_subset_sum <- df5_subset %>%
  group_by(MERGE) %>%
  summarise(SUM = sum(CNT))

df6_subset_sum <- df6_subset %>%
  group_by(MERGE) %>%
  summarise(SUM = sum(CNT))

df7_subset_sum <- df7_subset %>%
  group_by(MERGE) %>%
  summarise(SUM = sum(CNT))

df8_subset_sum <- df8_subset %>%
  group_by(MERGE) %>%
  summarise(SUM = sum(CNT))

df9_subset_sum <- df9_subset %>%
  group_by(MERGE) %>%
  summarise(SUM = sum(CNT))

df10_subset_sum <- df10_subset %>%
  group_by(MERGE) %>%
  summarise(SUM = sum(CNT))

# Combine this field with the original dataframe
df1_subset_sum <- merge(df1_subset, df1_subset_sum, by = "MERGE", all = T)
df2_subset_sum <- merge(df2_subset, df2_subset_sum, by = "MERGE", all = T)
df3_subset_sum <- merge(df3_subset, df3_subset_sum, by = "MERGE", all = T)
df4_subset_sum <- merge(df4_subset, df4_subset_sum, by = "MERGE", all = T)
df5_subset_sum <- merge(df5_subset, df5_subset_sum, by = "MERGE", all = T)
df6_subset_sum <- merge(df6_subset, df6_subset_sum, by = "MERGE", all = T)
df7_subset_sum <- merge(df7_subset, df7_subset_sum, by = "MERGE", all = T)
df8_subset_sum <- merge(df8_subset, df8_subset_sum, by = "MERGE", all = T)
df9_subset_sum <- merge(df9_subset, df9_subset_sum, by = "MERGE", all = T)
df10_subset_sum <- merge(df10_subset, df10_subset_sum, by = "MERGE", all = T)

# Filter based on TB filter requirements
filtered_df1_subset_sum <- df1_subset_sum %>%
  filter(CNT > 100) %>%
  filter(FWD_MEAN_MIN_QUAL >= 24) %>%
  filter(REV_MEAN_MIN_QUAL >= 24)

filtered_df2_subset_sum <- df2_subset_sum %>%
  filter(CNT > 100) %>%
  filter(FWD_MEAN_MIN_QUAL >= 24) %>%
  filter(REV_MEAN_MIN_QUAL >= 24)

filtered_df3_subset_sum <- df3_subset_sum %>%
  filter(CNT > 100) %>%
  filter(FWD_MEAN_MIN_QUAL >= 24) %>%
  filter(REV_MEAN_MIN_QUAL >= 24)

filtered_df4_subset_sum <- df4_subset_sum %>%
  filter(CNT > 100) %>%
  filter(FWD_MEAN_MIN_QUAL >= 24) %>%
  filter(REV_MEAN_MIN_QUAL >= 24)

filtered_df5_subset_sum <- df5_subset_sum %>%
  filter(CNT > 100) %>%
  filter(FWD_MEAN_MIN_QUAL >= 24) %>%
  filter(REV_MEAN_MIN_QUAL >= 24)

filtered_df6_subset_sum <- df6_subset_sum %>%
  filter(CNT > 100) %>%
  filter(FWD_MEAN_MIN_QUAL >= 24) %>%
  filter(REV_MEAN_MIN_QUAL >= 24)

filtered_df7_subset_sum <- df7_subset_sum %>%
  filter(CNT > 100) %>%
  filter(FWD_MEAN_MIN_QUAL >= 24) %>%
  filter(REV_MEAN_MIN_QUAL >= 24)

filtered_df8_subset_sum <- df8_subset_sum %>%
  filter(CNT > 100) %>%
  filter(FWD_MEAN_MIN_QUAL >= 24) %>%
  filter(REV_MEAN_MIN_QUAL >= 24)

filtered_df9_subset_sum <- df9_subset_sum %>%
  filter(CNT > 100) %>%
  filter(FWD_MEAN_MIN_QUAL >= 24) %>%
  filter(REV_MEAN_MIN_QUAL >= 24)

filtered_df10_subset_sum <- df10_subset_sum %>%
  filter(CNT > 100) %>%
  filter(FWD_MEAN_MIN_QUAL >= 24) %>%
  filter(REV_MEAN_MIN_QUAL >= 24)

# Calculate the per amino acid position mutational frequency (not codon)
filtered_df1_subset_sum$grouped_freq <- filtered_df1_subset_sum$SUM/filtered_df1_subset_sum$DENOM
filtered_df2_subset_sum$grouped_freq <- filtered_df2_subset_sum$SUM/filtered_df2_subset_sum$DENOM
filtered_df3_subset_sum$grouped_freq <- filtered_df3_subset_sum$SUM/filtered_df3_subset_sum$DENOM
filtered_df4_subset_sum$grouped_freq <- filtered_df4_subset_sum$SUM/filtered_df4_subset_sum$DENOM
filtered_df5_subset_sum$grouped_freq <- filtered_df5_subset_sum$SUM/filtered_df5_subset_sum$DENOM
filtered_df6_subset_sum$grouped_freq <- filtered_df6_subset_sum$SUM/filtered_df6_subset_sum$DENOM
filtered_df7_subset_sum$grouped_freq <- filtered_df7_subset_sum$SUM/filtered_df7_subset_sum$DENOM
filtered_df8_subset_sum$grouped_freq <- filtered_df8_subset_sum$SUM/filtered_df8_subset_sum$DENOM
filtered_df9_subset_sum$grouped_freq <- filtered_df9_subset_sum$SUM/filtered_df9_subset_sum$DENOM
filtered_df10_subset_sum$grouped_freq <- filtered_df10_subset_sum$SUM/filtered_df10_subset_sum$DENOM


# Subset out only the relevant fields to prepare for dms_tools2
df1deduplicated <- filtered_df1_subset_sum[,c("POSITION", "REF_CODON", "CODON", "CNT")]
df2deduplicated <- filtered_df2_subset_sum[,c("POSITION", "REF_CODON", "CODON", "CNT")]
df3deduplicated <- filtered_df3_subset_sum[,c("POSITION", "REF_CODON", "CODON", "CNT")]
df4deduplicated <- filtered_df4_subset_sum[,c("POSITION", "REF_CODON", "CODON", "CNT")]
df5deduplicated <- filtered_df5_subset_sum[,c("POSITION", "REF_CODON", "CODON", "CNT")]
df6deduplicated <- filtered_df6_subset_sum[,c("POSITION", "REF_CODON", "CODON", "CNT")]
df7deduplicated <- filtered_df7_subset_sum[,c("POSITION", "REF_CODON", "CODON", "CNT")]
df8deduplicated <- filtered_df8_subset_sum[,c("POSITION", "REF_CODON", "CODON", "CNT")]
df9deduplicated <- filtered_df9_subset_sum[,c("POSITION", "REF_CODON", "CODON", "CNT")]
df10deduplicated <- filtered_df10_subset_sum[,c("POSITION", "REF_CODON", "CODON", "CNT")]

# Subset only the distinct (deduplicated) rows
df1deduplicated <- distinct(df1deduplicated)
df2deduplicated <- distinct(df2deduplicated)
df3deduplicated <- distinct(df3deduplicated)
df4deduplicated <- distinct(df4deduplicated)
df5deduplicated <- distinct(df5deduplicated)
df6deduplicated <- distinct(df6deduplicated)
df7deduplicated <- distinct(df7deduplicated)
df8deduplicated <- distinct(df8deduplicated)
df9deduplicated <- distinct(df9deduplicated)
df10deduplicated <- distinct(df10deduplicated)

# Write to csv in current working directory)
write.csv(df1deduplicated, "df1_deduplicated.csv")
write.csv(df2deduplicated, "df2_deduplicated.csv")
write.csv(df3deduplicated, "df3_deduplicated.csv")
write.csv(df4deduplicated, "df4_deduplicated.csv")
write.csv(df5deduplicated, "df5_deduplicated.csv")
write.csv(df6deduplicated, "df6_deduplicated.csv")
write.csv(df7deduplicated, "df7_deduplicated.csv")
write.csv(df8deduplicated, "df8_deduplicated.csv")
write.csv(df9deduplicated, "df9_deduplicated.csv")
write.csv(df10deduplicated, "df10_deduplicated.csv")

# Convert to wide format
df1_wide_sum <- df1deduplicated %>%
  spread(CODON, CNT)
df2_wide_sum <- df2deduplicated %>%
  spread(CODON, CNT)
df3_wide_sum <- df3deduplicated %>%
  spread(CODON, CNT)
df4_wide_sum <- df4deduplicated %>%
  spread(CODON, CNT)
df5_wide_sum <- df5deduplicated %>%
  spread(CODON, CNT)
df6_wide_sum <- df6deduplicated %>%
  spread(CODON, CNT)
df7_wide_sum <- df7deduplicated %>%
  spread(CODON, CNT)
df8_wide_sum <- df8deduplicated %>%
  spread(CODON, CNT)
df9_wide_sum <- df9deduplicated %>%
  spread(CODON, CNT)
df10_wide_sum <- df10deduplicated %>%
  spread(CODON, CNT)

# Rename column names to be easily read by dms_tools2.prefs
colnames(df1_wide_sum)[1] = "site"
colnames(df2_wide_sum)[1] = "site"
colnames(df3_wide_sum)[1] = "site"
colnames(df4_wide_sum)[1] = "site"
colnames(df5_wide_sum)[1] = "site"
colnames(df6_wide_sum)[1] = "site"
colnames(df7_wide_sum)[1] = "site"
colnames(df8_wide_sum)[1] = "site"
colnames(df9_wide_sum)[1] = "site"
colnames(df10_wide_sum)[1] = "site"

colnames(df1_wide_sum)[2] = "wildtype" 
colnames(df2_wide_sum)[2] = "wildtype" 
colnames(df3_wide_sum)[2] = "wildtype" 
colnames(df4_wide_sum)[2] = "wildtype" 
colnames(df5_wide_sum)[2] = "wildtype" 
colnames(df6_wide_sum)[2] = "wildtype"
colnames(df7_wide_sum)[2] = "wildtype" 
colnames(df8_wide_sum)[2] = "wildtype"
colnames(df9_wide_sum)[2] = "wildtype" 
colnames(df10_wide_sum)[2] = "wildtype" 

# Remove NAs and replace with zeros
df1_wide_sum[is.na(df1_wide_sum)] <- 0
df2_wide_sum[is.na(df2_wide_sum)] <- 0
df3_wide_sum[is.na(df3_wide_sum)] <- 0
df4_wide_sum[is.na(df4_wide_sum)] <- 0
df5_wide_sum[is.na(df5_wide_sum)] <- 0
df6_wide_sum[is.na(df6_wide_sum)] <- 0
df7_wide_sum[is.na(df7_wide_sum)] <- 0
df8_wide_sum[is.na(df8_wide_sum)] <- 0
df9_wide_sum[is.na(df9_wide_sum)] <- 0
df10_wide_sum[is.na(df10_wide_sum)] <- 0

# Set site numbers to start at 1, not 9
df1_wide_sum$site <- c(1:390)
df2_wide_sum$site <- c(1:390)
df3_wide_sum$site <- c(1:390)
df4_wide_sum$site <- c(1:390)
df5_wide_sum$site <- c(1:390)
df6_wide_sum$site <- c(1:390)
df7_wide_sum$site <- c(1:390)
df8_wide_sum$site <- c(1:390)
df9_wide_sum$site <- c(1:390)
df10_wide_sum$site <- c(1:390)

# Write to csv in current working directory
write.csv(df1_wide_sum, "wtDNA_codoncounts.csv", row.names = FALSE)
write.csv(df2_wide_sum, "mutVirus-only-old_codoncounts.csv", row.names = FALSE)
write.csv(df3_wide_sum, "CHK-11-1_codoncounts.csv", row.names = FALSE)
write.csv(df4_wide_sum, "CHK-11-2_codoncounts.csv", row.names = FALSE)
write.csv(df5_wide_sum, "CHK-152-1_codoncounts.csv", row.names = FALSE)
write.csv(df6_wide_sum, "CHK-152-2_codoncounts.csv", row.names = FALSE)
write.csv(df7_wide_sum, "CHK-265-1_codoncounts.csv", row.names = FALSE)
write.csv(df8_wide_sum, "CHK-265-2_codoncounts.csv", row.names = FALSE)
write.csv(df9_wide_sum, "mutVirus-only-1_codoncounts.csv", row.names = FALSE)
write.csv(df10_wide_sum, "mutVirus-only-2_codoncounts.csv", row.names = FALSE)

# Write to txt in current working directory (actual input for dms_tools2)
write.csv(df1_wide_sum, "wtDNA_codoncounts.txt", sep = "\t", row.names = FALSE)
write.csv(df2_wide_sum, "mutVirus-only-old_codoncounts.txt", sep = "\t", row.names = FALSE)
write.csv(df3_wide_sum, "CHK-11-1_codoncounts.txt", sep = "\t", row.names = FALSE)
write.csv(df4_wide_sum, "CHK-11-2_codoncounts.txt", sep = "\t", row.names = FALSE)
write.csv(df5_wide_sum, "CHK-152-1_codoncounts.txt", sep = "\t", row.names = FALSE)
write.csv(df6_wide_sum, "CHK-152-2_codoncounts.txt", sep = "\t", row.names = FALSE)
write.csv(df7_wide_sum, "CHK-265-1_codoncounts.txt", sep = "\t", row.names = FALSE)
write.csv(df8_wide_sum, "CHK-265-2_codoncounts.txt", sep = "\t", row.names = FALSE)
write.csv(df9_wide_sum, "mutVirus-only-1_codoncounts.txt", sep = "\t", row.names = FALSE)
write.csv(df10_wide_sum, "mutVirus-only-2_codoncounts.txt", sep = "\t", row.names = FALSE)

5.6.2 Execution

Table 1: batchfile.csv
group name sel mock err
CHK-11 replicate-1-v1 CHK-11-1 mutVirus-only-1 wtDNA
CHK-11 replicate-2-v1 CHK-11-2 mutVirus-only-1 wtDNA
CHK-11 replicate-1-v2 CHK-11-1 mutVirus-only-2 wtDNA
CHK-11 replicate-2-v2 CHK-11-2 mutVirus-only-2 wtDNA
CHK-152 replicate-1-v1 CHK-152-1 mutVirus-only-1 wtDNA
CHK-152 replicate-2-v1 CHK-152-2 mutVirus-only-1 wtDNA
CHK-152 replicate-1-v2 CHK-152-1 mutVirus-only-2 wtDNA
CHK-152 replicate-2-v2 CHK-152-2 mutVirus-only-2 wtDNA
CHK-265 replicate-1-v1 CHK-265-1 mutVirus-only-1 wtDNA
CHK-265 replicate-2-v1 CHK-265-2 mutVirus-only-1 wtDNA
CHK-265 replicate-1-v2 CHK-265-1 mutVirus-only-2 wtDNA
CHK-265 replicate-2-v2 CHK-265-2 mutVirus-only-2 wtDNA


With the above batch file, the following code was executed to generate summary results for each replicate against all other replicates.

step6_dmstools_diffsel.sh
#!/bin/bash

dms2_batch_diffsel --outdir output --indir ./input/ --ncpus 10 --batchfile ./input/batchfile.csv --summaryprefix summary

5.8 dms-viz

Generating interactive plots to visualize mutational tolerance and antibody escape across CHIKV domains.

5.8.1 Formatting

Expand to View Code Block
step7_dms-viz_prep.R
#!/usr/bin/Rscript

# GOAL: Convert the 'ndet' metric for mutVirus.p2 to binary for tooltips heatmap 
# Packages to load
library(dplyr)

# Megalogo output already has long format for detected amino acids, so read in CSV for my reference df
df1 <- read.csv("./06132023/VirVarSeq/results/mixture_model/megalogo/filtered_mutVirus_p2_mega.csv")

# Read in both structures' existing dms-viz CSV files (PDB: 3N42, 3J2W)
df2 <- read.csv("./dms-viz/inputs/p2vsDNAerror_prefs_entropy_ndet_long_viz_3n42.csv")
df3 <- read.csv("./dms-viz/inputs/p2vsDNAerror_prefs_entropy_ndet_long_viz_3j2w.csv")

# Create a new variable ('MERGE_AA_p62') that combines the p62 numbering with AA to reference in the binary output
df1$MERGE_AA_p62 <- paste(df1$POSITION_p62, df1$AA)
df2$MERGE_AA_p62 <- paste(df2$p62_site, df2$mutant)
df3$MERGE_AA_p62 <- paste(df3$site, df3$mutant)

# Create a new variable ('NDET_SITE') that reports a '1' if it was detected in mutVirus.p2 (per megalogo file), or '0' if not
df2 <- df2 %>%
  mutate(ndet_site = if_else(MERGE_AA_p62 %in% df1$MERGE_AA_p62, "1", "0"))

df3 <- df3 %>%
  mutate(ndet_site = if_else(MERGE_AA_p62 %in% df1$MERGE_AA_p62, "1", "0"))

# Write the new dataframe to CSV for dms-viz JSON creation
write.csv(df2, "./dms-viz/inputs/p2vsDNAerror_prefs_entropy_ndet_long_viz_3n42_site.csv", row.names = FALSE)
write.csv(df3, "./dms-viz/inputs/p2vsDNAerror_prefs_entropy_ndet_long_viz_3j2w_site.csv", row.names = FALSE)

5.8.2 Visualization

Standard View
Expand to View Code Block
step8_dms-viz.sh
#!/bin/bash

conda activate dms_viz

configure-dms-viz format \
--name "CHIKV 'ndet' (3J2W)" \
--input ./inputs/p2vsDNAerror_prefs_entropy_ndet_long_viz_3j2w_site.csv \
--metric "ndet_site" \
--metric-name "AA Detected" \
--condition "epitope" \
--condition-name "E2" \
--structure ./inputs/modified_3J2W.pdb \
--sitemap ./inputs/sitemap_3j2w.csv \
--output ./outputs/chikv_ndet_3j2w.json \
--alphabet "ACDEFGHIKLMNPQRSTVWY" \
--included-chains "N O P" \
--excluded-chains "A E F G H I J K L M Q R S T" \
--check-pdb TRUE \
--colors "#b00000" \
--heatmap-limits 0,1 \
--floor True \
--tooltip-cols "{'mutdiffsel_11': 'CHK-11 Escape', 'mutdiffsel_152': 'CHK-152 Escape', 'mutdiffsel_265': 'CHK-265 Escape'}" \
--title "CHIKV 'ndet'" \
--summary-stat "sum"

configure-dms-viz format \
--name "CHIKV 'ndet' (3N42)" \
--input ./inputs/p2vsDNAerror_prefs_entropy_ndet_long_viz_3n42_site.csv \
--metric "ndet_site" \
--metric-name "AA Detected" \
--condition "epitope" \
--condition-name "p62" \
--structure ./inputs/3n42.pdb \
--sitemap ./inputs/sitemap_3n42.csv \
--output ./outputs/chikv_ndet_3n42.json \
--alphabet "ACDEFGHIKLMNPQRSTVWY" \
--included-chains "A B" \
--check-pdb TRUE \
--colors "#b00000" \
--heatmap-limits 0,1 \
--floor True \
--tooltip-cols "{'mutdiffsel_11': 'CHK-11 Escape', 'mutdiffsel_152': 'CHK-152 Escape', 'mutdiffsel_265': 'CHK-265 Escape'}" \
--title "CHIKV 'ndet'" \
--summary-stat "sum"

configure-dms-viz format \
--name "CHK-11 Escape (3J2W)" \
--input ./inputs/p2vsDNAerror_prefs_entropy_ndet_long_viz_3j2w.csv \
--metric "mutdiffsel_11" \
--metric-name "Diff Sel (log2)" \
--condition "epitope" \
--condition-name "E2" \
--structure ./inputs/modified_3J2W.pdb \
--sitemap ./inputs/sitemap_3j2w.csv \
--output ./outputs/chikv_chk11_3j2w.json \
--alphabet "ACDEFGHIKLMNPQRSTVWY" \
--included-chains "N O P" \
--excluded-chains "A E F G H I J K L M Q R S T" \
--check-pdb TRUE \
--colors "#b00000" \
--heatmap-limits 0,13 \
--floor True \
--tooltip-cols "{'ndet': '# AA Detected'}" \
--title "CHK-11 Escape" \
--summary-stat "sum"

configure-dms-viz format \
--name "CHK-152 Escape (3J2W)" \
--input ./inputs/p2vsDNAerror_prefs_entropy_ndet_long_viz_3j2w.csv \
--metric "mutdiffsel_152" \
--metric-name "Diff Sel (log2)" \
--condition "epitope" \
--condition-name "E2" \
--structure ./inputs/modified_3J2W.pdb \
--sitemap ./inputs/sitemap_3j2w.csv \
--output ./outputs/chikv_chk152_3j2w.json \
--alphabet "ACDEFGHIKLMNPQRSTVWY" \
--included-chains "N O P" \
--excluded-chains "A E F G H I J K L M Q R S T" \
--check-pdb TRUE \
--colors "#b00000" \
--heatmap-limits 0,13 \
--floor True \
--tooltip-cols "{'ndet': '# AA Detected'}" \
--title "CHK-152 Escape" \
--summary-stat "sum"

configure-dms-viz format \
--name "CHK-265 Escape (3J2W)" \
--input ./inputs/p2vsDNAerror_prefs_entropy_ndet_long_viz_3j2w.csv \
--metric "mutdiffsel_265" \
--metric-name "Diff Sel (log2)" \
--condition "epitope" \
--condition-name "E2" \
--structure ./inputs/modified_3J2W.pdb \
--sitemap ./inputs/sitemap_3j2w.csv \
--output ./outputs/chikv_chk265_3j2w.json \
--alphabet "ACDEFGHIKLMNPQRSTVWY" \
--included-chains "N O P" \
--excluded-chains "A E F G H I J K L M Q R S T" \
--check-pdb TRUE \
--colors "#b00000" \
--heatmap-limits 0,13 \
--floor True \
--tooltip-cols "{'ndet': '# AA Detected'}" \
--title "CHK-265 Escape" \
--summary-stat "sum"

configure-dms-viz format \
--name "CHK-11 Escape (3N42)" \
--input ./inputs/p2vsDNAerror_prefs_entropy_ndet_long_viz_3n42.csv \
--metric "mutdiffsel_11" \
--metric-name "Diff Sel (log2)" \
--condition "epitope" \
--condition-name "p62" \
--structure ./inputs/3n42.pdb \
--sitemap ./inputs/sitemap_3n42.csv \
--output ./outputs/chikv_chk11_3n42.json \
--alphabet "ACDEFGHIKLMNPQRSTVWY" \
--included-chains "A B" \
--check-pdb TRUE \
--colors "#b00000" \
--heatmap-limits 0,13 \
--floor True \
--tooltip-cols "{'ndet': '# AA Detected'}" \
--title "CHK-11 Escape" \
--summary-stat "sum"

configure-dms-viz format \
--name "CHK-152 Escape (3N42)" \
--input ./inputs/p2vsDNAerror_prefs_entropy_ndet_long_viz_3n42.csv \
--metric "mutdiffsel_152" \
--metric-name "Diff Sel (log2)" \
--condition "epitope" \
--condition-name "p62" \
--structure ./inputs/3n42.pdb \
--sitemap ./inputs/sitemap_3n42.csv \
--output ./outputs/chikv_chk152_3n42.json \
--alphabet "ACDEFGHIKLMNPQRSTVWY" \
--included-chains "A B" \
--check-pdb TRUE \
--colors "#b00000" \
--heatmap-limits 0,13 \
--floor True \
--tooltip-cols "{'ndet': '# AA Detected'}" \
--title "CHK-152 Escape" \
--summary-stat "sum"

configure-dms-viz format \
--name "CHK-265 Escape (3N42)" \
--input ./inputs/p2vsDNAerror_prefs_entropy_ndet_long_viz_3n42.csv \
--metric "mutdiffsel_265" \
--metric-name "Diff Sel (log2)" \
--condition "epitope" \
--condition-name "p62" \
--structure ./inputs/3n42.pdb \
--sitemap ./inputs/sitemap_3n42.csv \
--output ./outputs/chikv_chk265_3n42.json \
--alphabet "ACDEFGHIKLMNPQRSTVWY" \
--included-chains "A B" \
--check-pdb TRUE \
--colors "#b00000" \
--heatmap-limits 0,13 \
--floor True \
--tooltip-cols "{'ndet': '# AA Detected'}" \
--title "CHK-265 Escape" \
--summary-stat "sum"

configure-dms-viz join \
--input ./outputs/chikv_ndet_3j2w.json,./outputs/chikv_ndet_3n42.json,./outputs/chikv_chk11_3j2w.json,./outputs/chikv_chk152_3j2w.json,./outputs/chikv_chk265_3j2w.json,./outputs/chikv_chk11_3n42.json,./outputs/chikv_chk152_3n42.json,./outputs/chikv_chk265_3n42.json \
--output ./outputs/chikv_all.json \
--description ./viz-description.md


### E1 is R92 G92 B92
#### Protein Representation: Cartoon (default)
#### Peripheral Representation: Surface
#### Change Peripheral Color to R92 G92 B92
#### Select All Sites
#### Selection Representation: Surface
#### Download Protein Image

conda deactivate
Colored by Domain
Expand to View Code Block
step8_dms-viz-domain.sh
#!/bin/bash

conda activate dms_viz

configure-dms-viz format \
--name "CHIKV 'ndet' (3J2W)" \
--input ./inputs/p2vsDNAerror_prefs_entropy_ndet_long_viz_3j2w_site.csv \
--metric "ndet_site" \
--metric-name "AA Detected" \
--condition "domain" \
--condition-name "Domain" \
--structure ./inputs/modified_3j2w.pdb \
--sitemap ./inputs/sitemap_3j2w.csv \
--output ./outputs/chikv_ndet_3j2w_domain.json \
--alphabet "ACDEFGHIKLMNPQRSTVWY" \
--included-chains "N O P" \
--excluded-chains "A E F G H I J K L M Q R S T" \
--check-pdb TRUE \
--colors "#b00000,#b00000,#b00000,#b00000,#b00000,#b00000" \
--heatmap-limits 0,1 \
--floor True \
--tooltip-cols "{'mutdiffsel_11': 'CHK-11 Escape', 'mutdiffsel_152': 'CHK-152 Escape', 'mutdiffsel_265': 'CHK-265 Escape'}" \
--title "CHIKV 'ndet'" \
--summary-stat "sum"

configure-dms-viz format \
--name "CHIKV 'ndet' (3N42)" \
--input ./inputs/p2vsDNAerror_prefs_entropy_ndet_long_viz_3n42_site.csv \
--metric "ndet_site" \
--metric-name "AA Detected" \
--condition "domain" \
--condition-name "Domain" \
--structure ./inputs/3n42.pdb \
--sitemap ./inputs/sitemap_3n42.csv \
--output ./outputs/chikv_ndet_3n42_domain.json \
--alphabet "ACDEFGHIKLMNPQRSTVWY" \
--included-chains "A B" \
--check-pdb TRUE \
--colors "#b00000,#b00000,#b00000,#b00000,#b00000,#b00000,#b00000" \
--heatmap-limits 0,1 \
--floor True \
--tooltip-cols "{'mutdiffsel_11': 'CHK-11 Escape', 'mutdiffsel_152': 'CHK-152 Escape', 'mutdiffsel_265': 'CHK-265 Escape'}" \
--title "CHIKV 'ndet'" \
--summary-stat "sum"

configure-dms-viz format \
--name "CHK-11 Escape (3J2W)" \
--input ./inputs/p2vsDNAerror_prefs_entropy_ndet_long_viz_3j2w.csv \
--metric "mutdiffsel_11" \
--metric-name "Diff Sel (log2)" \
--condition "domain" \
--condition-name "Domain" \
--structure ./inputs/modified_3J2W.pdb \
--sitemap ./inputs/sitemap_3j2w.csv \
--output ./outputs/chikv_chk11_3j2w_domain.json \
--alphabet "ACDEFGHIKLMNPQRSTVWY" \
--included-chains "N O P" \
--excluded-chains "A E F G H I J K L M Q R S T" \
--check-pdb TRUE \
--colors "#b00000,#b00000,#b00000,#b00000,#b00000,#b00000" \
--heatmap-limits 0,13 \
--floor True \
--tooltip-cols "{'ndet': '# AA Detected'}" \
--title "CHK-11 Escape" \
--summary-stat "sum"

configure-dms-viz format \
--name "CHK-152 Escape (3J2W)" \
--input ./inputs/p2vsDNAerror_prefs_entropy_ndet_long_viz_3j2w.csv \
--metric "mutdiffsel_152" \
--metric-name "Diff Sel (log2)" \
--condition "domain" \
--condition-name "Domain" \
--structure ./inputs/modified_3J2W.pdb \
--sitemap ./inputs/sitemap_3j2w.csv \
--output ./outputs/chikv_chk152_3j2w_domain.json \
--alphabet "ACDEFGHIKLMNPQRSTVWY" \
--included-chains "N O P" \
--excluded-chains "A E F G H I J K L M Q R S T" \
--check-pdb TRUE \
--colors "#b00000,#b00000,#b00000,#b00000,#b00000,#b00000" \
--heatmap-limits 0,13 \
--floor True \
--tooltip-cols "{'ndet': '# AA Detected'}" \
--title "CHK-152 Escape" \
--summary-stat "sum"

configure-dms-viz format \
--name "CHK-265 Escape (3J2W)" \
--input ./inputs/p2vsDNAerror_prefs_entropy_ndet_long_viz_3j2w.csv \
--metric "mutdiffsel_265" \
--metric-name "Diff Sel (log2)" \
--condition "domain" \
--condition-name "Domain" \
--structure ./inputs/modified_3J2W.pdb \
--sitemap ./inputs/sitemap_3j2w.csv \
--output ./outputs/chikv_chk265_3j2w_domain.json \
--alphabet "ACDEFGHIKLMNPQRSTVWY" \
--included-chains "N O P" \
--excluded-chains "A E F G H I J K L M Q R S T" \
--check-pdb TRUE \
--colors "#b00000,#b00000,#b00000,#b00000,#b00000,#b00000" \
--heatmap-limits 0,13 \
--floor True \
--tooltip-cols "{'ndet': '# AA Detected'}" \
--title "CHK-265 Escape" \
--summary-stat "sum"

configure-dms-viz format \
--name "CHK-11 Escape (3N42)" \
--input ./inputs/p2vsDNAerror_prefs_entropy_ndet_long_viz_3n42.csv \
--metric "mutdiffsel_11" \
--metric-name "Diff Sel (log2)" \
--condition "domain" \
--condition-name "Domain" \
--structure ./inputs/3n42.pdb \
--sitemap ./inputs/sitemap_3n42.csv \
--output ./outputs/chikv_chk11_3n42_domain.json \
--alphabet "ACDEFGHIKLMNPQRSTVWY" \
--included-chains "A B" \
--check-pdb TRUE \
--colors "#b00000,#b00000,#b00000,#b00000,#b00000,#b00000,#b00000" \
--heatmap-limits 0,13 \
--floor True \
--tooltip-cols "{'ndet': '# AA Detected'}" \
--title "CHK-11 Escape" \
--summary-stat "sum"

configure-dms-viz format \
--name "CHK-152 Escape (3N42)" \
--input ./inputs/p2vsDNAerror_prefs_entropy_ndet_long_viz_3n42.csv \
--metric "mutdiffsel_152" \
--metric-name "Diff Sel (log2)" \
--condition "domain" \
--condition-name "Domain" \
--structure ./inputs/3n42.pdb \
--sitemap ./inputs/sitemap_3n42.csv \
--output ./outputs/chikv_chk152_3n42_domain.json \
--alphabet "ACDEFGHIKLMNPQRSTVWY" \
--included-chains "A B" \
--check-pdb TRUE \
--colors "#b00000,#b00000,#b00000,#b00000,#b00000,#b00000,#b00000" \
--heatmap-limits 0,13 \
--floor True \
--tooltip-cols "{'ndet': '# AA Detected'}" \
--title "CHK-152 Escape" \
--summary-stat "sum"

configure-dms-viz format \
--name "CHK-265 Escape (3N42)" \
--input ./inputs/p2vsDNAerror_prefs_entropy_ndet_long_viz_3n42.csv \
--metric "mutdiffsel_265" \
--metric-name "Diff Sel (log2)" \
--condition "domain" \
--condition-name "Domain" \
--structure ./inputs/3n42.pdb \
--sitemap ./inputs/sitemap_3n42.csv \
--output ./outputs/chikv_chk265_3n42_domain.json \
--alphabet "ACDEFGHIKLMNPQRSTVWY" \
--included-chains "A B" \
--check-pdb TRUE \
--colors "#b00000,#b00000,#b00000,#b00000,#b00000,#b00000,#b00000" \
--heatmap-limits 0,13 \
--floor True \
--tooltip-cols "{'ndet': '# AA Detected'}" \
--title "CHK-265 Escape" \
--summary-stat "sum"

configure-dms-viz join \
--input ./outputs/chikv_ndet_3j2w_domain.json,./outputs/chikv_ndet_3n42_domain.json,./outputs/chikv_chk11_3j2w_domain.json,./outputs/chikv_chk152_3j2w_domain.json,./outputs/chikv_chk265_3j2w_domain.json,./outputs/chikv_chk11_3n42_domain.json,./outputs/chikv_chk152_3n42_domain.json,./outputs/chikv_chk265_3n42_domain.json \
--output ./outputs/chikv_all_domain.json \
--description ./viz-description.md

### E1 is R92 G92 B92
### E2 is R135 G145 B225 for 3n42
#### Protein Representation: Surface
##### Change Protein Color to R135 G145 B225 for 3n42
#### Peripheral Representation: Surface
#### Change Peripheral Color to R92 G92 B92
#### Select All Sites
##### Select E3 domain for 3n42
#### Selection Representation: Surface
#### Download Protein Image

conda deactivate

Citation

The following citation can be used to reference this Wiki document.

Citation

BibTeX citation:
@article{m._stumpf2025,
  author = {M. Stumpf, Megan and Brunetti, Tonya and J. Davenport,
    Bennett and K. McCarthy, Mary and E. Morrison, Thomas},
  title = {Deep Mutationally Scanned {(DMS)} {CHIKV} {E3/E2} Virus
    Library Maps Viral Amino Acid Preferences and Predicts Viral Escape
    Mutants of Neutralizing {CHIKV} Antibodies},
  journal = {J. Virol.},
  date = {2025-03-04},
  url = {https://www.biorxiv.org/content/10.1101/2024.12.04.626854v1.full},
  doi = {10.1101/2024.12.04.626854},
  langid = {en}
}
For attribution, please cite this work as:
M. Stumpf M, Brunetti T, J. Davenport B, K. McCarthy M, E. Morrison T. 2025. Deep mutationally scanned (DMS) CHIKV E3/E2 virus library maps viral amino acid preferences and predicts viral escape mutants of neutralizing CHIKV antibodies. J Virol https://doi.org/10.1101/2024.12.04.626854.